Skip to content

Fix groupby any/all on null-containing string columns#22926

Merged
rapids-bot[bot] merged 3 commits into
rapidsai:mainfrom
galipremsagar:groupby_string_bool_nulls
Jun 23, 2026
Merged

Fix groupby any/all on null-containing string columns#22926
rapids-bot[bot] merged 3 commits into
rapidsai:mainfrom
galipremsagar:groupby_string_bool_nulls

Conversation

@galipremsagar

@galipremsagar galipremsagar commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

GroupBy.all / GroupBy.any returned wrong results for string columns containing nulls. Two issues in GroupBy._bool_reduce's string→bool coercion:

  1. Lost missingness for na_value=np.nan string dtypes. count_characters() > 0 does not propagate the source null (a NaN-flavored missing compares as False), so a missing value was treated as a real empty/falsy string. The source column's null positions are now restored on the bool column.
  2. astype(np.bool_) on a null-containing extension bool. Casting a null-containing pandas-extension bool to numpy bool is (intentionally) rejected in pandas-compatible mode. The bool column is now kept nullable through the min/max aggregation and normalized to np.bool_ only after empty/skipna groups are filled.

As a result, all over an all-null group is now vacuously True and any is False (with nulls dropped under skipna=True), matching pandas across all four StringDtype variants (python/pyarrow × na_value=pd.NA/np.nan).

This fixes tests/groupby/test_reductions.py::test_string_dtype_all_na (40 parametrizations) in the cudf.pandas pandas test suite, and incidentally test_any and test_groupby_bool_aggs[True-any-vals12]; their xfail entries are dropped.

Out of scope

test_masked_kleene_logic[any-False-...] remains xfailed: a mixed True/False/NA group under skipna=False requires full three-valued (Kleene) NA propagation (e.g. False OR NA -> NA), which is a larger change than the all-null / all-valued cases addressed here.

Tests

Added test_groupby_any_all_string_nulls (cuDF unit test) covering any/all × skipna over null and valued string groups across all four StringDtype variants. It fails without this change.

Verification

  • pandas-testing/.../test_reductions.py under cudf.pandas (with plugin): 1243 passed, 121 xfailed, 0 failed, 0 XPASS.
  • Full cuDF groupby unit suite: no regressions.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar requested a review from a team as a code owner June 18, 2026 22:27
@galipremsagar galipremsagar requested review from Matt711 and wence- June 18, 2026 22:27
@github-actions github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Jun 18, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Jun 18, 2026
@galipremsagar galipremsagar added bug Something isn't working breaking Breaking change labels Jun 18, 2026
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 72243b23-96a2-41a1-8f9d-46e4384395cf

📥 Commits

Reviewing files that changed from the base of the PR and between 92d6582 and e74de79.

📒 Files selected for processing (1)
  • python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py
✅ Files skipped from review due to trivial changes (1)
  • python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Fixed groupby().any() / groupby().all() to preserve missing-value behavior more accurately, especially with skipna=False.
    • Ensured any/all result dtypes match pandas across nullable and Arrow-backed input types, including for empty/all-missing groups.
  • Tests

    • Added new groupby reduction coverage for any/all with string nulls and for pandas-compatible output dtypes across multiple dtype “flavors”.
    • Updated existing test expectation mappings to reflect corrected behavior and error messages.

Walkthrough

GroupBy._bool_reduce's _to_bool_col helper is updated to restore the source column's null mask after boolean coercion when null counts diverge, fill missing values with True under skipna=False, and defer casting to np.bool_ until nulls are absent. A new _bool_result_dtype function selects output dtypes matching input flavor (Arrow-backed to pd.ArrowDtype(pa.bool_()), pandas nullable to pd.BooleanDtype(), others to np.bool_), applied to both Series and DataFrame results. Two parametrized tests validate any/all on grouped nullable strings and confirm output dtype matching across dtype flavors. Pandas-testing failure expectations are updated accordingly.

Changes

GroupBy boolean reduction null mask and dtype fix

Layer / File(s) Summary
Import and _to_bool_col null mask fix
python/cudf/cudf/core/groupby/groupby.py
Adds is_pandas_nullable_extension_dtype import and refines _to_bool_col to reapply the source null mask when null counts diverge after boolean coercion, fill remaining nulls with True for skipna=False, and defer np.bool_ cast until no nulls remain.
Output dtype selection and application
python/cudf/cudf/core/groupby/groupby.py
Introduces _bool_result_dtype to map input dtype flavor (Arrow-backed, pandas nullable/masked integer-like, other) to corresponding output boolean dtypes, applies dtype selection when filling empty/all-null groups for Series results, and extends dtype-aware fillna and casting to DataFrame value columns.
New parametrized tests for string nulls and result dtype matching
python/cudf/cudf/tests/groupby/test_reductions.py
Updates copyright header and adds test_groupby_any_all_string_nulls covering grouped any/all over multiple StringDtype variants with skipna handling, and test_groupby_any_all_result_dtype validating output dtype matches pandas across nullable/pyarrow dtype flavors under pandas_compatible mode.
Pandas-testing failure expectation updates
python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py
Removes now-passing test_any and test_groupby_bool_aggs[True-any-vals12] from failure map, removes large block of test_string_dtype_all_na failures, adds TODO placeholders for test_first_last_skipna and multiple reduction cases, and updates expectations for test_basic_aggregations, test_indices_with_missing, extrema tests, and test_sum_skipna variants.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • rapidsai/cudf#22813: Modifies the same GroupBy._bool_reduce boolean-coercion and casting path, skipping grouping-key columns to prevent incorrect casting — directly adjacent to this PR's null-mask and dtype fix in the same function.

Suggested labels

improvement, non-breaking

Suggested reviewers

  • vyasr
  • mroeschke
  • bdice
  • brandon-b-miller
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main fix: handling null values in string columns for groupby any/all operations.
Description check ✅ Passed The description is well-related to the changeset, clearly explaining the two bugs fixed, the impact on behavior, and verification results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@galipremsagar galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jun 23, 2026
@vyasr

vyasr commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/merge

@rapids-bot rapids-bot Bot merged commit d7d9751 into rapidsai:main Jun 23, 2026
126 of 127 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in cuDF Python Jun 23, 2026
mhaseeb123 pushed a commit to mhaseeb123/cudf that referenced this pull request Jun 24, 2026
…2926)

`GroupBy.all` / `GroupBy.any` returned wrong results for string columns containing nulls. Two issues in `GroupBy._bool_reduce`'s string→bool coercion:

1. **Lost missingness for `na_value=np.nan` string dtypes.** `count_characters() > 0` does not propagate the source null (a NaN-flavored missing compares as `False`), so a missing value was treated as a real empty/falsy string. The source column's null positions are now restored on the bool column.
2. **`astype(np.bool_)` on a null-containing extension bool.** Casting a null-containing pandas-extension bool to numpy bool is (intentionally) rejected in pandas-compatible mode. The bool column is now kept nullable through the `min`/`max` aggregation and normalized to `np.bool_` only after empty/`skipna` groups are filled.

As a result, `all` over an all-null group is now vacuously `True` and `any` is `False` (with nulls dropped under `skipna=True`), matching pandas across all four `StringDtype` variants (`python`/`pyarrow` × `na_value=pd.NA`/`np.nan`).

This fixes `tests/groupby/test_reductions.py::test_string_dtype_all_na` (40 parametrizations) in the `cudf.pandas` pandas test suite, and incidentally `test_any` and `test_groupby_bool_aggs[True-any-vals12]`; their xfail entries are dropped.

### Out of scope
`test_masked_kleene_logic[any-False-...]` remains xfailed: a *mixed* True/False/NA group under `skipna=False` requires full three-valued (Kleene) NA propagation (e.g. `False OR NA -> NA`), which is a larger change than the all-null / all-valued cases addressed here.

### Tests
Added `test_groupby_any_all_string_nulls` (cuDF unit test) covering `any`/`all` × `skipna` over null and valued string groups across all four `StringDtype` variants. It fails without this change.

### Verification
- `pandas-testing/.../test_reductions.py` under `cudf.pandas` (with plugin): 1243 passed, 121 xfailed, 0 failed, 0 XPASS.
- Full cuDF groupby unit suite: no regressions.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#22926
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change bug Something isn't working cudf.pandas Issues specific to cudf.pandas Python Affects Python cuDF API.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants